--- layout: post title: "Welcome to Jekyll!" date: 2024-04-02 17:59:06 +0200 categories: jekyll update ---
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import calplot
df = pd.read_csv("data.csv")
However, in this sub-assignment, the focus will just be on the following 14 categories of crimes:
focuscrimes = set(['WEAPON LAWS', 'PROSTITUTION', 'DRIVING UNDER THE INFLUENCE', 'ROBBERY', 'BURGLARY', 'ASSAULT', 'DRUNKENNESS', 'DRUG/NARCOTIC', 'TRESPASS', 'LARCENY/THEFT', 'VANDALISM', 'VEHICLE THEFT', 'STOLEN PROPERTY', 'DISORDERLY CONDUCT'])
f = df[df['Category'].isin(focuscrimes)]
f = f.copy()
As the assignment deals with temporal data, the the 'Date' and 'Time' columns are converted to datetime. Then month and hour components are extracted and two new columns 'Month' and 'Hour' are added to the dataframe:
f['Month'] = pd.to_datetime(f['Date']).dt.month
f['Hour'] = pd.to_datetime(f['Time'], format = '%H:%M').dt.hour
Weekly Patterns
plt.figure(figsize = (15,25))
weekdays_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
for i, crime in enumerate(focuscrimes, start =1):
ax = plt.subplot(7, 2, i)
f_temp =f[f['Category'] == crime]
counts = f_temp['DayOfWeek'].value_counts().reindex(weekdays_order)
for j, day in enumerate(weekdays_order):
ax.bar(j, counts.loc[day], color='lavender', edgecolor = 'indigo', linewidth = 2, width=0.6)
x = np.arange(len(weekdays_order))
ax.set_xticks(x)
ax.set_xticklabels(weekdays_order, rotation=90)
ax.set_ylabel('Crime count')
y_max = f_temp['DayOfWeek'].value_counts().max()
ax.set_ylim(top=y_max * 1.3)
ax.text(0.02, 0.95, f'{crime}', transform=ax.transAxes,
horizontalalignment='left', verticalalignment='top',
fontsize=14 )
if i > 12:
ax.set_xlabel('Day of the Week')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
else:
ax.set_xlabel('')
ax.set_xticklabels([])
plt.tight_layout(pad=3.0)
plt.show()
/var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/2757337662.py:32: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=3.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/2757337662.py:32: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=3.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/2757337662.py:32: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=3.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/2757337662.py:32: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=3.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/2757337662.py:32: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=3.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/2757337662.py:32: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=3.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/2757337662.py:32: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=3.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/2757337662.py:32: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=3.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/2757337662.py:32: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=3.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/2757337662.py:32: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=3.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/2757337662.py:32: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=3.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/2757337662.py:32: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=3.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/2757337662.py:32: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=3.0)
The plots above represent the weekday distribution of focus crimes. It is important to note that the zoom level on the y-axis varies for different crime categories due to considerable differences in the total number of incidents within each category. To provide a clearer overview of the weekday trends and relative changes in the number of incidents for each crime category, the y-axis zoom level is adjusted accordingly. And this also follows for the monthly and hourly trends that are plotted further ahead of the report.
It is evident that for most categories, there is no significant relative change in the total number of reported incidents on different weekdays. Most categories exhibit a consistent trend, with slight increases or decreases during the weekend, which aligns with intuition based on the category. However, there were some interesting observations that warrant further attention:
Monthly Patterns
plt.figure(figsize = (16,28))
months_order = list(range(1,13))
months_labels = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
for i, crime in enumerate(focuscrimes, start =1):
ax = plt.subplot(7, 2, i)
f_temp =f[f['Category'] == crime]
counts = f_temp['Month'].value_counts().reindex(months_order)
for j, month in enumerate(months_order):
ax.bar(j, counts.loc[month], color='lavender', edgecolor = 'indigo', linewidth = 1, width=0.5)
x = np.arange(len(months_order))
ax.set_xticks(x)
ax.set_xticklabels(months_labels, rotation=0)
ax.set_ylabel('Crime count')
y_max = counts.max()
ax.set_ylim(top=y_max * 1.3)
ax.text(0.02, 0.95, f'{crime}', transform=ax.transAxes,
horizontalalignment='left', verticalalignment='top',
fontsize=12 )
if i > 12:
ax.set_xlabel('Month of the Year')
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
else:
ax.set_xlabel('')
ax.set_xticklabels([])
plt.tight_layout(pad=4.0)
plt.show()
/var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1260041138.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1260041138.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1260041138.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1260041138.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1260041138.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1260041138.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1260041138.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1260041138.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1260041138.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1260041138.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1260041138.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1260041138.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1260041138.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0)
The plots above represent the monthly distribution of focus crimes. No consistent pattern can be observed across the different categories. None of the specific categories stand out prominently. February appears lower in all categories except for 'driving under the influence,' but this is likely due to the fewer days in February compared to other months. The variation in the number of days (31 or 30) in alternating months could account for the slight fluctuations rather than a more consistent trend.
'Prostitution' and 'driving under the influence' seem to exhibit relatively lower incidence rates during the summer months. It was expected that there would be more significant differences in the reported crime trends during different seasons of the year. However, San Francisco does not experience four distinct seasons like Denmark, which may explain this observation.
(As mentioned earlier, the zoom level on the y-axis varies for different crime categories)
Hourly (24h) Patterns
plt.figure(figsize = (16,28))
hours_order = list(range(0,24))
hours_labels = list(range(1,25))
for i, crime in enumerate(focuscrimes, start =1):
ax = plt.subplot(7, 2, i)
f_temp =f[f['Category'] == crime]
counts = f_temp['Hour'].value_counts().reindex(hours_order)
for j, hour in enumerate(hours_order):
ax.bar(j, counts.loc[hour], color='lavender', edgecolor = 'indigo', linewidth = 1, width=0.5)
x = np.arange(len(hours_order))
ax.set_xticks(x)
ax.set_xticklabels(hours_labels, rotation=0)
ax.set_ylabel('Crime count')
y_max = counts.max()
ax.set_ylim(top=y_max * 1.3)
ax.text(0.02, 0.95, f'{crime}', transform=ax.transAxes,
horizontalalignment='left', verticalalignment='top',
fontsize=12 )
if i > 12:
ax.set_xlabel('Hour of the day')
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
else:
ax.set_xlabel('')
ax.set_xticklabels([])
plt.tight_layout(pad=4.0)
plt.show()
/var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1244553225.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1244553225.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1244553225.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1244553225.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1244553225.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1244553225.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1244553225.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1244553225.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1244553225.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1244553225.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1244553225.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1244553225.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1244553225.py:33: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0)
The plots above represent the hourly distribution of focus crimes within a day. All categories show a peak in the early hours, which aligns with the understanding that crimes often occur during the night. As daylight emerges, the number of reported incidents decreases for all categories before increasing again later in the day. However, several noteworthy observations merit attention:
(As mentioned earlier, the zoom level on the y-axis varies for different crime categories)
Hourly Patters over a Week
day_to_num = {'Monday':0, 'Tuesday':1, 'Wednesday':2, 'Thursday':3, 'Friday': 4, 'Saturday':5, 'Sunday':6}
f['WeekHour'] = f['DayOfWeek'].map(day_to_num) * 24 + f['Hour']
plt.figure(figsize = (16,28))
week_hours_order = list(range(0,168))
weekdays_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
for i, crime in enumerate(focuscrimes, start =1):
ax = plt.subplot(7, 2, i)
f_temp =f[f['Category'] == crime]
counts = f_temp['WeekHour'].value_counts().reindex(week_hours_order)
for j, hour in enumerate(week_hours_order):
ax.bar(j, counts[hour], color='mediumpurple', edgecolor = 'indigo', linewidth = 0.08, width=0.5)
x = list(range(0,168,24))
ax.set_xticks(x)
ax.set_xticklabels(weekdays_order, rotation=0)
ax.set_ylabel('Crime count')
y_max = counts.max()
ax.set_ylim(top=y_max * 1.3)
ax.text(0.02, 0.95, f'{crime}', transform=ax.transAxes,
horizontalalignment='left', verticalalignment='top',
fontsize=12 )
for day in range(1, 7):
plt.axvline(day * 24, color='silver', linestyle=':')
if i > 12:
ax.set_xlabel('Hour of the Week')
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
else:
ax.set_xlabel('')
ax.set_xticklabels([])
plt.tight_layout(pad=4.0)
plt.savefig('168HourlyPatterns.png', dpi=300)
plt.show()
/var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/391088015.py:36: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0)
The plots above represent the hourly distribution of focus crimes within a week. The plots exhibit consistency in the hourly trends, albeit with relatively higher peaks on certain days, which aligns with the analysis of the weekday plots
(As mentioned earlier, the zoom level on the y-axis varies for different crime categories)
p = f[f['Category'] == 'PROSTITUTION']
p = p.copy()
Converting the Date column to datetime:
p['Date'] = pd.to_datetime(p['Date'])
Grouping the incidents by date and freuency to get the number of incidents each day to plot in the calendarplot:
p_counts = p.groupby(pd.Grouper(key='Date', freq ='D')).size()
title = 'Calendar plot of the frequency of Prostitution Incidents from 2003-2018'
calplot.calplot(p_counts, suptitle = title, cmap = 'YlGn', tight_layout = True)
plt.tight_layout(pad=4.0)
plt.savefig('CalPlot YlGn.png', dpi = 300)
/var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/523338368.py:3: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect. plt.tight_layout(pad=4.0) /var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/523338368.py:3: UserWarning: The figure layout has changed to tight plt.tight_layout(pad=4.0)
The chosen crime category is 'Prostitution' as it stood out in the above observations. The calendar plot above represents the frequency of reported incidents during each day from January 2013 to mid-May 2018. It is quite interesting to observe the differences in reported frequency across different years. The greener the plot, the more frequent the incidents were. It is evident that 2003, 2007, and 2008 had the most reported incidents, while from 2012 to 2014, there were almost none. However, in 2014, 2016, and 2017, there are only a few green days, but these days are very green. This suggests that if prostitution was reported on a certain day, there were either a lot of cases or none at all. Perhaps the perpetrators became better at hiding the prostitution rings, making them more difficult to detect. However, once detected, it usually involved a significant operation. The drop in daily reported incidents of prostitution after 2011 could be due to a new law or restriction in San Francisco.
dui = f[f['Category']== 'DRIVING UNDER THE INFLUENCE']
dui = dui.copy()
Total number incidents of driving under the influence each hour:
dui_counts = dui.groupby('Hour').size()
Creating 24-hour pattern polar bar chart:
theta = np.linspace(0, 2*np.pi, 24, endpoint = False)
It makes logical sense to call the hights of the barcounts as radii:
radii = dui_counts.values
fig, ax = plt.subplots(figsize=(8,8), subplot_kw={'projection':'polar'})
bars = ax.bar(theta, radii, width=2*np.pi/24.5, color='aliceblue', edgecolor = 'steelblue', align='edge')
ax.set_theta_offset(np.pi / 2)
ax.set_theta_direction(-1)
ax.set_xticks(theta)
ax.set_xticklabels([f'{i}:00' for i in range(24)])
plt.title('24-Hour Pattern of Driving Under the Influence Incidents')
plt.savefig('PolarPlot.png', dpi = 300)
plt.show()
The polar plot above represents the 24-hour pattern of 'Driving under the influence' crime incidents. It is evident that the reported incidents are lowest from 4:00 to 16:00, with approximately 100 or fewer cases, then gradually start increasing to approximately 300 cases by 20:00, and up to 650-700 incidents until 3:00 the following morning with a peak from midnight to 1 am.
Using a polar bar chart allows for a clear visualization of the cyclical nature of the crime incidents throughout the day. However, it is not ideal for effectively comparinging the frequency of incidents between different time periods of the day, as the circular layout can make it difficult to accurately assess differences in magnitude.
I have used LLMs models to lean to improve my plots for better visualisation. To learn different features of the plotting functions to experiment.
Have a look at the Data Viz intro lecture (Week 2), then answer the following questions in your own words.
What is the difference between data and metadata? How does that relate to the GPS tracks-example?
Data refers to raw facts and figures, such as geographical coordinates in GPS tracking. Metadata is data about data, detailing the context and characteristics of the primary data, like the device used for GPS tracking and the time of recording. In GPS tracks, data gives the essential information (locations and timestamps), while metadata provides context (device type, activity purpose), enhancing data understanding and management.
Sune says that the human eye is a great tool for data analysis. Do you agree? Explain why/why not.
The human eye is indeed a powerful tool for data analysis, especially for identifying patterns, trends, and outliers in visual data representations like graphs and charts. It excels at spotting anomalies and irregularities that automated analysis tools might overlook, thanks to our innate ability to process visual information and recognize patterns.
- Can you think of something that is difficult for the human eye to process visually. Explain why your example is difficult.
The human eye has limitations, particularly when dealing with large datasets or complex, high-dimensional data. It struggles with accurately quantifying or comparing subtle differences in magnitude or identifying patterns in data beyond three dimensions.
For example, while the eye can easily discern the general trend in a scatter plot, it may find it difficult to accurately determine the correlation coefficient between variables just by looking. This difficulty arises because the human visual system is not equipped to process numerical precision or multi-dimensional relationships in the same way a statistical analysis can.
Another example is the polar bar chart that was analysed in part 1 of this assignment. As mantioned, the circular layout can make it difficult for the human eye to accurately assess differences in magnitude.
Younger Patients:
- Medication A effective for 90 out of 100 patients (90% effectiveness).
- Medication B effective for 30 out of 50 patients (60% effectiveness).
Older Patients:
- Medication A effective for 20 out of 100 patients (20% effectiveness).
- Medication B effective for 40 out of 50 patients (80% effectiveness).
Analysis by Age Group: A is more effective for younger patients, while B is more effective for older patients.
Combined Results:
- A was effective for 110 out of 200 patients (55% effectiveness).
- B was effective for 70 out of 100 patients (70% effectiveness).
The combined results misleadingly suggest that B is overall more effective than A, even though A was more effective in the younger group and B was more effective in the older group.
Exploratory Data Analysis (EDA) is the initial phase of data analysis, where the goal is to discover patterns, spot anomalies, test hypotheses, and check assumptions through summary statistics and graphical representations. It's about understanding the data's structure and extracting insights without prior hypotheses.
Explanatory Data Analysis, on the other hand, focuses on explaining findings, validating models, and communicating results clearly to a target audience. It aims to convey the main characteristics and insights derived from the data, often after the exploratory phase has identified the key trends and relationships.
In this sub-assignment I want you to create a choropleth map of San Francisco, showing where it's safest to leave your car on Sundays, following the instructions from Week 3.
When you're done, reflect on the following questions.
Based on your map and analysis, where should you park the car for it to be safest on a Sunday? And where's the worst place?
Based on the map, we can observe different districts have varying shades of red ranging from light red to deep red, indicating the increasing vehicle theft cases which suggests the relative safety to stop a car in those areas. From the map, we can see TENDERLOIN with the lightest shape of red in the map is the most safest place for parking while a place called INGLESIDE is depicted in deep red color is the most dangerous district to park.
Using visualizatios can help us uncover powerful data-patterns. However, when designing visualizations, we need to be aware of several illusions that can lead viewers to misinterpret the data we are showing (i.e. perceptual errors):
Yes, by altering the range on the right zoom level from 1 to 1600 to a broader range of 1 to 10000, diminishes the ability to distinguish between districts’ safety levels, which shows all districts with the similar risk to park due to the adjusted scale covers a wide range that minimizes the visual differences in theft incidents for automobiles across districts.
Perceptual errors in statics refer to misinterpretations or misunderstandings of statiscal data due to the way information is presented or due to cognitive biases, which significantly impact the conclusions drawn from data leading to incorrect decisions or judgments. The reason why they are a problem can be accounted for several reasons. Firstly, graphs that are not scaled properly can contribute to a distorted view of data. For example, given the above example, by simply enlarging the size of right zoom level can make people feel there is no difference in parking within different districts as they all display the same color when we have a big scale. On the contrary, by simply shrinking the scale of the right zoom level may lead to mindless panic for observing there are so many red districts in the map. Besides, they can contribute to confirmation errors when an observed figure supporting their preconceived notions or hypothesis, while dismissing evidence that contradicts them. For example, a person who believes a district is very dangerous may prefer believe it has the highest number of all sorts of crimes even though the statistics may show the district only has a high number in some crimes. Additionally, another concern about perceptual errors is base rate neglect by dismissing or undervaluing the base rate of an event. For instance, individuals viewing the map might be deterred from parking their car in INGLESIDE for its representation in the deepest red color. However, the truth is the overall prevalence of a vehicle theft remains originally relatively low. Besides, this map has considered all the thefts occurred from 2003 to 2018, contributing to the aggregate crimes in different districts, which is not the daily average numbers of automobile theft crimes across districts.
In this data error, we see that a lot of prostitution arrests apparently happens on Thursdays. At least, that is when the police officers who write the reports have set the crime to have happened. This will bias the data when binned, in a number of ways. Most importantly, the data will now be skewed towards Thursday being most prone to cases of prosititution. Any aggregated statistics will show this as well, such as the average amount of arrests for prostitution, since Thursdays are overrepresented in the data. There might be one week that had an abnormal amount of prostitution arrests for some reason, causing the rest of the data to be severely skewed towards Thursdays.
It gives a skewed idea that Thursdays are when most prostitution cases are happening and might cause the police to increase their patrols looking for prostitution on Thursdays, which could have an recursive effect of causing more arrests for prostitution on Thursdays, looping around to our models showing Thursdays as the most "active" day for prostitution. It can cause incorrect assumptions in the police and from researchers like us, who, while modelling the data, might not notice the error, and assume that Thursdays are when the police should be patrolling.
Here we can see a jitter plot of the different times a crime is reported. The plot here is restricted to the hour between 19:00 and 20:00. Often times, the crimes are clearly rounded to the nearest whole hour or half hour. This might happen because the officer making the report didn't know exactly when the crime happened, and makes an estimate. But it definitely skews the data toward either the start, middle and end of hours. It underrepresents the real times that a crime can happen. So that when the data is binned, the peaks will be around the rounded hours. This means that times with "high" crime are artificially created, since the times are mostly an estimate.
For modelling and data visualization, this can give a incorrect picture of the real world happenings in San Francisco. Most crimes don't happen exactly at 19:00 or 20:00, they are much more likely to happen at a time like 19:03 or 19:27 or 19:46. This is misleading for what the actual crime patterns in the city might be. This means that any crime prevention policies made on the basis of models using this data, will most likely overemphasize the times at the beginning or end of the hour.
The data here shows a hot spot of crime around the Hall of Justice in San Francisco. This is most likely because it is a landmark most police officers in San Francisco know. So when an officer doesn't quite remember where an arrest happened, but know it was near the Hall of Justice, perhaps on the same street, they put that as the address. But this is misleading, since it shows the Hall of Justice as a sex offence hot spot, which seems unlikely to be the truth. This means any aggregated crime statistics for that area will show a large amount of crimes around the Hall of Justice.
Since the data is now skewed towards the Hall of Justice having many sex offences, it skews the area around it as well. Our models will show this area as having a lot of arrests, which might lead to the police or local government make decisions such as increased patrols or allocation of more resources to the police in that area, which would be misguided.
The sheer amount of larceny theft that is reported by the police in San Francisco could be the result of human error. Some of the theft might instead be robbery, burglary or any of the other theft related crimes. If we look at the boxplots for each of the categories it becomes very clear:
data = df
focuscrimes = set(['WEAPON LAWS', 'PROSTITUTION', 'DRIVING UNDER THE INFLUENCE', 'ROBBERY', 'BURGLARY', 'ASSAULT', 'DRUNKENNESS', 'DRUG/NARCOTIC', 'TRESPASS', 'LARCENY/THEFT', 'VANDALISM', 'VEHICLE THEFT', 'STOLEN PROPERTY', 'DISORDERLY CONDUCT'])
data = data[data["Category"].isin(focuscrimes)]
crimes_per_day = data.groupby(['Category', 'Date']).size().reset_index(name = 'Incidents')
#print(crimes_per_day)
crimes_per_day.boxplot(by='Category', column='Incidents', figsize=(15,15))
plt.ylabel("Incidents")
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
We can see that larceny/theft has so many reports each day that it enlarges the whole plot. But you can see this pretty clearly for the whole dataset. If we compare all the category counts in
districts = data["PdDistrict"].dropna().unique()
nrows = 5
ncols = 2
fig, axes = plt.subplots(nrows, ncols, figsize=(20, 20), sharey=True, sharex=True)
axes = axes.flatten()
for i,district in enumerate(districts):
crime_data = data[data["PdDistrict"] == district]
crime_district_counts = crime_data["Category"].value_counts(normalize=True).sort_index()
axes[i].bar(crime_district_counts.index,crime_district_counts.values)
axes[i].set_title(district)
axes[i].set_xlabel("Category")
axes[i].set_ylabel("Counts")
axes[i].tick_params('x',labelrotation=90)
plt.tight_layout()
plt.show()
This might have resulted especially from the years of 2015-2016, as seen in the calendar plot below
data["Date"] = pd.to_datetime(data["Date"], yearfirst=True)
data.set_index("Date", inplace=True)
category_data = data[data["Category"] == "LARCENY/THEFT"]
category_counts = category_data.resample("D").size()
calplot.calplot(category_counts, cmap='YlGn', yearlabel_kws={'fontname':'sans-serif'})
plt.show()
/var/folders/fk/jf1n3p615pj42c3yn8qkk3b80000gn/T/ipykernel_69114/1857008572.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy data["Date"] = pd.to_datetime(data["Date"], yearfirst=True)
We found LLMs mostly useful as a guide. Most of the code we could write on our own, or using help from more traditional online resources. But especially when looking for examples of human noise, through conversations with the LLM and explaining our own thought process, we were able to come to a conclusion we were satisfied with. Often with the code, minor mistakes were what the LLM could help with, such as getting an understanding of specific parameters a function might need to achieve a certain look or result.
| Part | Contributor | Student ID |
|---|---|---|
| Part 1, Part 2 | Dimasha Nanayakkara | s194710 |
| Part 3 | Ning Sun | s232171 |
| Part 4 | Max Heiberg Bestle | s194574 |